Multi-Class Classification

🏠 ⮐ Artificial Intelligence ⮐ Machine Learning ⮐ Supervised Learning ⮐ Classification ⮐

Core Concept

Multi-class classification extends binary classification to handle three or more mutually exclusive categories, where each instance belongs to exactly one class. The model must learn decision boundaries that partition the feature space into multiple regions, one for each class. Examples include digit recognition (0–9), document categorisation by topic, species identification, or medical diagnosis across multiple disease types. This represents a more complex setting than binary classification: instead of a single boundary separating two outcomes, the model must distinguish among many alternatives while preserving the constraint that predictions are mutually exclusive.

Key Characteristics

Multiple decision boundaries – The feature space is partitioned into C regions for C classes. Depending on the algorithm, this may be achieved by learning one boundary per class (e.g. One-vs-Rest), pairwise boundaries (One-vs-One), or a single joint partitioning (e.g. decision trees, neural networks with softmax).
Decomposition strategies – Many binary algorithms are extended to multi-class via One-vs-Rest (OvR), which trains one classifier per class (that class vs all others) and selects the class with highest confidence, or One-vs-One (OvO), which trains a binary classifier for every pair of classes—C(C−1)/2 for C classes—and uses voting for the final prediction. Some algorithms handle multiple classes natively without decomposition.
Softmax and cross-entropy – Neural networks for multi-class typically use a softmax output layer to convert logits into a probability distribution over all classes that sums to 1. Training usually employs cross-entropy loss. The predicted class is the one with highest probability; the full distribution provides uncertainty information.
Evaluation metrics – Accuracy remains the proportion of correct predictions overall. Confusion matrices become C×C, showing predicted vs actual class and which classes are commonly confused. Macro-averaging computes a metric per class then averages, treating classes equally; micro-averaging aggregates across all classes, weighting frequent classes more; weighted averaging accounts for class imbalance when desired.
Class imbalance and hierarchy – Imbalance is more complex with multiple classes—some may be well-represented while others are rare. Hierarchical classification can help when classes have natural groupings (e.g. first mammal/bird/reptile, then species). Error costs may differ between class pairs (e.g. benign vs malignant misclassification).

Common Applications

Digit and character recognition – Assigning images or signals to one of 10 digits or a set of character classes
Document categorisation – Assigning documents to one topic or category from a fixed set (e.g. news, sports, science)
Species identification – Classifying specimens or images into one of many species or taxa
Medical diagnosis (multiple conditions) – Determining which of several disease types or conditions is present from patient data
Intent classification – Mapping user utterances or queries to one of several predefined intents
Product categorisation – Placing items into a single category in a taxonomy
Gesture or activity recognition – Classifying signals or video into one of several gestures or activities

Multi-Class Classification Algorithms
Multi-class classification algorithms either extend binary methods via decomposition (One-vs-Rest, One-vs-One) or softmax-style outputs, or handle multiple classes natively. Choice depends on decision boundary complexity, interpretability, scalability in the number of classes, and compatibility with class imbalance or hierarchical structure.

Logistic Regression (Multi-class) – Extended via softmax regression (multinomial logistic regression) or OvR/OvO; provides probabilistic outputs for each class with a linear-style boundary per class in the OvR view.

Based on: Logistic Regression (multinomial extension or OvR/OvO decomposition)

Method Group: Linear methods

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Support Vector Machines (Multi-class) – Typically uses OvR or OvO; effective in high-dimensional spaces with appropriate kernels; each binary sub-problem retains maximum-margin formulation.

Based on: SVM (OvR or OvO decomposition)

Method Group: Kernel Methods

https://scikit-learn.org/stable/modules/svm.html#multi-class-classification

Decision Trees – Handle multiple classes natively by splitting nodes using entropy or Gini impurity across all classes; no decomposition required; highly interpretable.

Based on: Recursive partitioning / greedy splitting (standalone foundational algorithm)

Method Group: Tree-based Methods

https://scikit-learn.org/stable/modules/tree.html

Random Forest – Ensemble of decision trees with bootstrap sampling and random feature selection; multi-class via voting across trees; robust to overfitting.

Based on: Decision Trees (ensemble with bootstrap aggregating and random feature selection)

Method Group: Tree-based Methods > Ensemble Methods

https://scikit-learn.org/stable/modules/ensemble.html#random-forests

Gradient Boosting (Multi-class) – Builds sequential trees; multi-class via one-vs-rest or multinomial loss; implementations like XGBoost and LightGBM are highly competitive.

Based on: Decision Trees (sequential ensemble with gradient descent optimization of loss function)

Method Group: Tree-based Methods > Ensemble Methods

https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting

https://xgboost.readthedocs.io/

Naive Bayes – Computes posterior probabilities for each class using Bayes' theorem with feature independence; effective for text and high-dimensional data; no decomposition.

Based on: Bayes' Theorem with feature independence assumption (standalone probabilistic approach)

Method Group: Probabilistic Methods

https://scikit-learn.org/stable/modules/naive_bayes.html

K-Nearest Neighbors – Assigns class by majority vote among k nearest neighbors; straightforward extension from binary; no retraining for new classes in principle.

Based on: Distance metrics and majority voting (standalone instance-based approach)

Method Group: Instance-based Methods

https://scikit-learn.org/stable/modules/neighbors.html#classification

Neural Networks – Softmax activation in output layer with cross-entropy loss; learn complex non-linear boundaries between classes; full probability distribution over classes.

Based on: Feedforward / MLP (multi-class via softmax output and cross-entropy)

Method Group: Neural Networks

https://scikit-learn.org/stable/modules/neural_networks_supervised.html

https://pytorch.org/docs/stable/nn.html#crossentropyloss